Efficient Record Linkage Algorithms Using Complete Linkage Clustering.
نویسندگان
چکیده
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
منابع مشابه
RLT-S: A Web System for Record Linkage
BACKGROUND Record linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challengi...
متن کاملImage Clustering for a Fuzzy Hamming Distance Based CBIR System
A linear search in a Content-Based Image Retrieval (CBIR) system is time consuming. Like any database system an indexing technique is mandatory for the system to be efficient. This paper studies the application of clustering algorithms to a Fuzzy Hamming Distance based CBIR system for building the image index. The study shows good results using complete linkage agglomerative clustering.
متن کاملEfficient sequential and parallel algorithms for record linkage
BACKGROUND AND OBJECTIVE Integrating data from multiple sources is a crucial and challenging problem. Even though there exist numerous algorithms for record linkage or deduplication, they suffer from either large time needs or restrictions on the number of datasets that they can integrate. In this paper we report efficient sequential and parallel algorithms for record linkage which handle any n...
متن کاملA Survey on Efficient Clustering Methods with Effective Pruning Techniques for Probabilistic Graphs
This paper provides a survey on K-NN queries, DCR query, agglomerative complete linkage clustering and Extension of edit-distance-based definition graph algorithm and solving decision problems under uncertainty. This existing system give an beginning to Graph agglomeration aims to divide information into clusters per their similarities, and variety of algorithms are planned for agglomeration gr...
متن کاملA Survey on Efficient Clustering Methods with Effective Pruning Techniques for Probabilistic Graphs
This paper provides a survey on K-NN queries, DCR query, agglomerative complete linkage clustering and Extension of edit-distance-based definition graph algorithm and solving decision problems under uncertainty. This existing system give an beginning to Graph agglomeration aims to divide information into clusters per their similarities, and variety of algorithms are planned for agglomeration gr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PloS one
دوره 11 4 شماره
صفحات -
تاریخ انتشار 2016